This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Build a model to accurately predict whether the patients in the dataset have diabetes or not?
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1) 268 of 768 are 1, the others are 0
#!pip install lightgbm
# Data Wrangling and Data Analysis
import pandas as pd , numpy as np
# Visualization
from matplotlib import pyplot as plt, style
import matplotlib
import seaborn as sns
import plotly
# Feature Engineering / Feature Selection
from sklearn.feature_selection import VarianceThreshold
from sklearn import model_selection
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import scipy.stats as stats
from sklearn import base
import optuna
from optuna.integration import LightGBMPruningCallback
from optuna import Trial, visualization
from functools import partial
from sklearn.impute import KNNImputer
from sklearn.preprocessing import RobustScaler
from optuna.integration import OptunaSearchCV
from sklearn.neighbors import KNeighborsClassifier
# Model Building
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn import metrics
# Ignore Warings
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv("health care diabetes.csv")
df.head(5)
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
2.1 lets check shape of data
df.shape
(768, 9)
2.2 lets check Summary Statisrics of the data
df.describe()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
| mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
| std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
| 25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
| 50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
| 75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
| max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
Interpretation :
2.3 lets check for Distribution of Data
style.use('classic')
a = 3 # number of rows
b = 3 # number of columns
c = 1 # initialize plot counter
plt.figure(figsize=(30,20))
for feature in df.columns:
plt.subplot(a, b, c)
sns.kdeplot(x = df[feature], fill=True, color="red").set_title(f'{feature} Distribution ',fontsize=15)
c = c + 1
plt.tight_layout(pad = 4.0)
plt.show()
Interpretation :
2.4 Missing Value Imputation
# Replacing Invalid Zeros with NaN Value For Imputation
df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)
# Lets Check for Missing Values
pd.DataFrame({"Missing Values":df.isna().sum(),"Percentage":np.round(100*df.isna().sum()/df.shape[0],3)})
| Missing Values | Percentage | |
|---|---|---|
| Pregnancies | 0 | 0.000 |
| Glucose | 5 | 0.651 |
| BloodPressure | 35 | 4.557 |
| SkinThickness | 227 | 29.557 |
| Insulin | 374 | 48.698 |
| BMI | 11 | 1.432 |
| DiabetesPedigreeFunction | 0 | 0.000 |
| Age | 0 | 0.000 |
| Outcome | 0 | 0.000 |
# Lets Impute The Missing Values Using KNN Imputer
knn_imputer= KNNImputer(n_neighbors=5)
df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = knn_imputer.fit_transform(df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']])
# CHecking for Missing Values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null float64 2 BloodPressure 768 non-null float64 3 SkinThickness 768 non-null float64 4 Insulin 768 non-null float64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(6), int64(3) memory usage: 54.1 KB
# Save the cleaned Data Into CSV For Visualization
df.to_csv("Cleaned_Health_Care_Data.csv")
2.5 Looking For Outliers
style.use('ggplot')
a = 3 # number of rows
b = 3 # number of columns
c = 1 # initialize plot counter
plt.figure(figsize=(30,20))
for feature in df.columns:
plt.subplot(a, b, c)
sns.boxplot(x = df[feature]).set_title(f'{feature} ',fontsize=15)
c = c + 1
plt.tight_layout(pad = 4.0)
plt.show()
Interpretation :
2.6 plot describing the data types and the count of variables
df.dtypes.value_counts().plot(kind='bar')
plt.show()
Interpretation :
2.7 Lets Visualize Correlation Between Each Features
style.use('ggplot')
a = 4 # number of rows
b = 2 # number of columns
c = 1 # initialize plot counter
plt.figure(figsize=(30,30))
i = 0
for i in range(0, df.columns.nunique()-1):
plt.subplot(a, b, c)
sns.scatterplot(x = df.iloc[:,i], y =df.iloc[:,i+1] ).set_title(f'Correlation of {df.iloc[:,i].name} with {df.iloc[:,i+1].name}',fontsize=20)
c = c + 1
plt.tight_layout(pad = 4.0)
plt.show()
Interpretation :
2.8 Correlation Analysis Via Heat Map
# Using Spearman Correlation as Data Contains Outliers
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(method="spearman"), annot=True)
plt.title("Correlation Heat Map")
plt.show()
df.corr().style.background_gradient(cmap='viridis')
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| Pregnancies | 1.000000 | 0.126862 | 0.215428 | 0.094821 | 0.049211 | 0.022979 | -0.033523 | 0.544341 | 0.221898 |
| Glucose | 0.126862 | 1.000000 | 0.231934 | 0.246306 | 0.633321 | 0.233029 | 0.136860 | 0.265111 | 0.492195 |
| BloodPressure | 0.215428 | 0.231934 | 1.000000 | 0.240791 | 0.112299 | 0.297395 | -0.000262 | 0.326810 | 0.177829 |
| SkinThickness | 0.094821 | 0.246306 | 0.240791 | 1.000000 | 0.270635 | 0.658912 | 0.119837 | 0.121907 | 0.276664 |
| Insulin | 0.049211 | 0.633321 | 0.112299 | 0.270635 | 1.000000 | 0.296331 | 0.131794 | 0.153443 | 0.336613 |
| BMI | 0.022979 | 0.233029 | 0.297395 | 0.658912 | 0.296331 | 1.000000 | 0.152134 | 0.030702 | 0.312216 |
| DiabetesPedigreeFunction | -0.033523 | 0.136860 | -0.000262 | 0.119837 | 0.131794 | 0.152134 | 1.000000 | 0.033561 | 0.173844 |
| Age | 0.544341 | 0.265111 | 0.326810 | 0.121907 | 0.153443 | 0.030702 | 0.033561 | 1.000000 | 0.238356 |
| Outcome | 0.221898 | 0.492195 | 0.177829 | 0.276664 | 0.336613 | 0.312216 | 0.173844 | 0.238356 | 1.000000 |
Interpretation :
2.9 Lets Check If Data Is Balanced Or Imbalanced
df.Outcome.value_counts(normalize=True).plot(kind='bar',color="royalblue")
plt.title("Proportion Of Outcome")
plt.show()
Interpretation :
So Far We have Arrived with Following Analysis : -
By these Inferences Its Evident that we should Use Tree Based Ensemble Techniques Which Can Handle Outliers, Skewed Data, Data Imbalances and Can Learn From Weak Learners
# Kfold
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=101)
4.1.1 Standardization
Standardization is Necessary while Dealing with Algorithms Which works on Euclidean distances such as KNN
As there are Outliers In Data Set we Will Use RobustScalar As it is robust to Outliers
Preprocessing = Pipeline(steps =[("Scaler",RobustScaler()),("KNN",KNeighborsClassifier(n_jobs=-1))])
# Separating Features And Target
Features = df.iloc[:,0:-1]
Target = df.iloc[:,-1]
# Defining Parameter Space
grid_param = {'KNN__n_neighbors': optuna.distributions.IntUniformDistribution(1,20,1),
'KNN__weights': optuna.distributions.CategoricalDistribution(['uniform','distance']),
'KNN__metric': optuna.distributions.CategoricalDistribution(['minkowski','euclidean','manhattan'])}
# Bayesian Hyperparameter Optimization Using Optuna
optuna_search = OptunaSearchCV(estimator=Preprocessing, param_distributions = grid_param,
cv = kf, n_jobs = -1, n_trials=100, random_state = 101, refit = True,
scoring = 'roc_auc', verbose = 0)
optuna_search.fit(Features, Target)
[I 2021-09-28 21:33:26,919] A new study created in memory with name: no-name-59e170ea-7485-4643-9d38-21708d01c81d [I 2021-09-28 21:33:34,068] Trial 3 finished with value: 0.7973062678062679 and parameters: {'KNN__n_neighbors': 8, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 3 with value: 0.7973062678062679. [I 2021-09-28 21:33:34,176] Trial 0 finished with value: 0.8200427350427351 and parameters: {'KNN__n_neighbors': 15, 'KNN__weights': 'distance', 'KNN__metric': 'manhattan'}. Best is trial 0 with value: 0.8200427350427351. [I 2021-09-28 21:33:34,229] Trial 2 finished with value: 0.8192264957264956 and parameters: {'KNN__n_neighbors': 16, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 0 with value: 0.8200427350427351. [I 2021-09-28 21:33:34,364] Trial 1 finished with value: 0.7859273504273505 and parameters: {'KNN__n_neighbors': 5, 'KNN__weights': 'uniform', 'KNN__metric': 'minkowski'}. Best is trial 0 with value: 0.8200427350427351. [I 2021-09-28 21:33:35,634] Trial 4 finished with value: 0.8083931623931624 and parameters: {'KNN__n_neighbors': 11, 'KNN__weights': 'uniform', 'KNN__metric': 'minkowski'}. Best is trial 0 with value: 0.8200427350427351. [I 2021-09-28 21:33:35,726] Trial 5 finished with value: 0.7799358974358974 and parameters: {'KNN__n_neighbors': 5, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 0 with value: 0.8200427350427351. [I 2021-09-28 21:33:35,787] Trial 6 finished with value: 0.7859273504273505 and parameters: {'KNN__n_neighbors': 5, 'KNN__weights': 'uniform', 'KNN__metric': 'minkowski'}. Best is trial 0 with value: 0.8200427350427351. [I 2021-09-28 21:33:35,851] Trial 7 finished with value: 0.757957264957265 and parameters: {'KNN__n_neighbors': 4, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 0 with value: 0.8200427350427351. [I 2021-09-28 21:33:37,087] Trial 8 finished with value: 0.7314102564102565 and parameters: {'KNN__n_neighbors': 3, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 0 with value: 0.8200427350427351. [I 2021-09-28 21:33:37,186] Trial 9 finished with value: 0.8233846153846154 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 9 with value: 0.8233846153846154. [I 2021-09-28 21:33:37,258] Trial 10 finished with value: 0.7956495726495726 and parameters: {'KNN__n_neighbors': 7, 'KNN__weights': 'distance', 'KNN__metric': 'manhattan'}. Best is trial 9 with value: 0.8233846153846154. [I 2021-09-28 21:33:37,347] Trial 11 finished with value: 0.7960683760683761 and parameters: {'KNN__n_neighbors': 6, 'KNN__weights': 'uniform', 'KNN__metric': 'minkowski'}. Best is trial 9 with value: 0.8233846153846154. [I 2021-09-28 21:33:38,647] Trial 12 finished with value: 0.7367279202279203 and parameters: {'KNN__n_neighbors': 3, 'KNN__weights': 'uniform', 'KNN__metric': 'euclidean'}. Best is trial 9 with value: 0.8233846153846154. [I 2021-09-28 21:33:38,817] Trial 13 finished with value: 0.8236239316239317 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 13 with value: 0.8236239316239317. [I 2021-09-28 21:33:38,897] Trial 14 finished with value: 0.8236239316239317 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 13 with value: 0.8236239316239317. [I 2021-09-28 21:33:38,962] Trial 15 finished with value: 0.8236239316239317 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 13 with value: 0.8236239316239317. [I 2021-09-28 21:33:40,171] Trial 16 finished with value: 0.8241652421652421 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'distance', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:40,344] Trial 17 finished with value: 0.8236239316239317 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:40,424] Trial 19 finished with value: 0.8184472934472934 and parameters: {'KNN__n_neighbors': 17, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:40,432] Trial 18 finished with value: 0.8189957264957265 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:41,720] Trial 20 finished with value: 0.8216296296296296 and parameters: {'KNN__n_neighbors': 17, 'KNN__weights': 'distance', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:41,869] Trial 21 finished with value: 0.8216296296296296 and parameters: {'KNN__n_neighbors': 17, 'KNN__weights': 'distance', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:41,919] Trial 22 finished with value: 0.8141623931623931 and parameters: {'KNN__n_neighbors': 13, 'KNN__weights': 'distance', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:41,991] Trial 23 finished with value: 0.8141623931623931 and parameters: {'KNN__n_neighbors': 13, 'KNN__weights': 'distance', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:43,241] Trial 24 finished with value: 0.8181994301994301 and parameters: {'KNN__n_neighbors': 14, 'KNN__weights': 'distance', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:43,340] Trial 25 finished with value: 0.8177193732193733 and parameters: {'KNN__n_neighbors': 14, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:43,498] Trial 26 finished with value: 0.8236239316239317 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:43,575] Trial 27 finished with value: 0.8236239316239317 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:44,742] Trial 28 finished with value: 0.8236239316239317 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:44,791] Trial 29 finished with value: 0.8236239316239317 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:44,967] Trial 30 finished with value: 0.6509971509971509 and parameters: {'KNN__n_neighbors': 1, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:45,097] Trial 31 finished with value: 0.8189957264957265 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:46,304] Trial 32 finished with value: 0.8189957264957265 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:46,364] Trial 33 finished with value: 0.8204757834757835 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'distance', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:46,500] Trial 34 finished with value: 0.8189957264957265 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:46,632] Trial 35 finished with value: 0.8233846153846154 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:47,824] Trial 36 finished with value: 0.8192264957264956 and parameters: {'KNN__n_neighbors': 16, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:47,917] Trial 37 finished with value: 0.8185954415954415 and parameters: {'KNN__n_neighbors': 15, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:48,047] Trial 38 finished with value: 0.8192264957264956 and parameters: {'KNN__n_neighbors': 16, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:48,132] Trial 39 finished with value: 0.8192264957264956 and parameters: {'KNN__n_neighbors': 16, 'KNN__weights': 'uniform', 'KNN__metric': 'manhattan'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:49,324] Trial 40 finished with value: 0.8106994301994301 and parameters: {'KNN__n_neighbors': 10, 'KNN__weights': 'uniform', 'KNN__metric': 'minkowski'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:49,423] Trial 41 finished with value: 0.8083931623931624 and parameters: {'KNN__n_neighbors': 11, 'KNN__weights': 'uniform', 'KNN__metric': 'minkowski'}. Best is trial 16 with value: 0.8241652421652421. [I 2021-09-28 21:33:49,563] Trial 42 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 42 with value: 0.8245840455840456. [I 2021-09-28 21:33:49,673] Trial 43 finished with value: 0.80914245014245 and parameters: {'KNN__n_neighbors': 10, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 42 with value: 0.8245840455840456. [I 2021-09-28 21:33:50,875] Trial 44 finished with value: 0.8253760683760684 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'uniform', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:50,959] Trial 45 finished with value: 0.8253760683760684 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'uniform', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:51,079] Trial 46 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:51,195] Trial 47 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:52,405] Trial 48 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:52,527] Trial 49 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:52,589] Trial 50 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:52,693] Trial 51 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:53,921] Trial 52 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:54,052] Trial 53 finished with value: 0.8204444444444444 and parameters: {'KNN__n_neighbors': 17, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:54,169] Trial 54 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:54,213] Trial 55 finished with value: 0.8204444444444444 and parameters: {'KNN__n_neighbors': 17, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:55,456] Trial 56 finished with value: 0.8204444444444444 and parameters: {'KNN__n_neighbors': 17, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:55,566] Trial 57 finished with value: 0.8204444444444444 and parameters: {'KNN__n_neighbors': 17, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:55,671] Trial 58 finished with value: 0.8204444444444444 and parameters: {'KNN__n_neighbors': 17, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:55,715] Trial 59 finished with value: 0.8223105413105412 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:56,987] Trial 60 finished with value: 0.8162279202279201 and parameters: {'KNN__n_neighbors': 15, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:57,033] Trial 61 finished with value: 0.8162279202279201 and parameters: {'KNN__n_neighbors': 15, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:57,228] Trial 62 finished with value: 0.8223105413105412 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:57,288] Trial 63 finished with value: 0.8162279202279201 and parameters: {'KNN__n_neighbors': 15, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:58,616] Trial 64 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:58,686] Trial 65 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:58,902] Trial 66 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:33:58,960] Trial 67 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:00,142] Trial 68 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:00,233] Trial 69 finished with value: 0.8224843304843305 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:00,416] Trial 70 finished with value: 0.8224843304843305 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:00,494] Trial 71 finished with value: 0.8224843304843305 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:01,707] Trial 72 finished with value: 0.8224843304843305 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:01,852] Trial 73 finished with value: 0.8223105413105412 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:01,977] Trial 74 finished with value: 0.8223105413105412 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:02,131] Trial 75 finished with value: 0.8223105413105412 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:03,206] Trial 76 finished with value: 0.8223105413105412 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:03,404] Trial 77 finished with value: 0.8223105413105412 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:03,526] Trial 78 finished with value: 0.8223105413105412 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:03,662] Trial 79 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:04,723] Trial 80 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:04,964] Trial 81 finished with value: 0.8201994301994301 and parameters: {'KNN__n_neighbors': 16, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:05,137] Trial 82 finished with value: 0.8201994301994301 and parameters: {'KNN__n_neighbors': 16, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:05,241] Trial 83 finished with value: 0.8201994301994301 and parameters: {'KNN__n_neighbors': 16, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:06,660] Trial 84 finished with value: 0.8201994301994301 and parameters: {'KNN__n_neighbors': 16, 'KNN__weights': 'distance', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:06,947] Trial 85 finished with value: 0.8224843304843305 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:07,184] Trial 86 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:07,312] Trial 87 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:08,643] Trial 88 finished with value: 0.8245840455840456 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'distance', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:08,864] Trial 89 finished with value: 0.8253760683760684 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'uniform', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:09,127] Trial 90 finished with value: 0.8253760683760684 and parameters: {'KNN__n_neighbors': 19, 'KNN__weights': 'uniform', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:09,364] Trial 91 finished with value: 0.8237549857549856 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'uniform', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:11,211] Trial 92 finished with value: 0.8199985754985756 and parameters: {'KNN__n_neighbors': 17, 'KNN__weights': 'uniform', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:11,487] Trial 93 finished with value: 0.8237549857549856 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'uniform', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:11,576] Trial 94 finished with value: 0.8237549857549856 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'uniform', 'KNN__metric': 'minkowski'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:11,736] Trial 95 finished with value: 0.8237549857549856 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'uniform', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:13,905] Trial 96 finished with value: 0.8237549857549856 and parameters: {'KNN__n_neighbors': 20, 'KNN__weights': 'uniform', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:14,142] Trial 98 finished with value: 0.8219914529914529 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'uniform', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:14,158] Trial 99 finished with value: 0.8219914529914529 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'uniform', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684. [I 2021-09-28 21:34:14,214] Trial 97 finished with value: 0.8219914529914529 and parameters: {'KNN__n_neighbors': 18, 'KNN__weights': 'uniform', 'KNN__metric': 'euclidean'}. Best is trial 44 with value: 0.8253760683760684.
OptunaSearchCV(cv=StratifiedKFold(n_splits=10, random_state=101, shuffle=True),
estimator=Pipeline(steps=[('Scaler', RobustScaler()),
('KNN',
KNeighborsClassifier(n_jobs=-1))]),
n_jobs=-1, n_trials=100,
param_distributions={'KNN__metric': CategoricalDistribution(choices=('minkowski', 'euclidean', 'manhattan')),
'KNN__n_neighbors': IntUniformDistribution(high=20, low=1, step=1),
'KNN__weights': CategoricalDistribution(choices=('uniform', 'distance'))},
random_state=101, scoring='roc_auc')
print(optuna_search.best_estimator_)
print(optuna_search.best_score_)
Pipeline(steps=[('Scaler', RobustScaler()),
('KNN',
KNeighborsClassifier(metric='euclidean', n_jobs=-1,
n_neighbors=19))])
0.8253760683760684
# Study History
optuna.visualization.plot_optimization_history(optuna_search.study_)
# Plotting Best Parametrs
optuna.visualization.plot_param_importances(optuna_search.study_)
# Plotting Paramets Range
optuna.visualization.plot_slice(optuna_search.study_)
# Running The Model With Best Parameters
best_knn_model = optuna_search.best_estimator_
# Training model
best_knn_model.fit(Features,Target)
# Calculating Cross Validation Score
Knn_predictions = model_selection.cross_val_predict(best_knn_model,Features, Target,cv = kf, n_jobs = -1)
print("\n KNN AUC_ROC Score : ", metrics.roc_auc_score(Knn_predictions, Target),"\n")
# Classification Report
print("Classification Report : \n\n" ,metrics.classification_report(Target, Knn_predictions))
KNN AUC_ROC Score : 0.7498219766728054
Classification Report :
precision recall f1-score support
0 0.80 0.87 0.83 500
1 0.70 0.59 0.64 268
accuracy 0.77 768
macro avg 0.75 0.73 0.74 768
weighted avg 0.76 0.77 0.76 768
# # Defining A Objecting Function For Hyper Paramater Tuning
# def objective_xgb(trial, X, y, early_stopping_rounds):
# # Param Space
# params = {
# "verbosity" : 0,
# 'eval_metric' : 'auc',
# #'tree_method':'gpu_hist',# Use GPU acceleration
# #'predictor': 'gpu_predictor',
# "seed": 101,
# 'n_jobs': -1,
# "alpha": trial.suggest_loguniform("alpha", 1e-8, 1.0),
# "lambda": trial.suggest_loguniform("lambda", 1e-8, 100.0),
# "gamma": trial.suggest_loguniform("gamma", 1e-8, 100.0),
# "colsample_bytree": trial.suggest_loguniform("colsample_bytree", 0.5, 0.8),
# "subsample": trial.suggest_loguniform("subsample", 0.5, 0.8),
# "learning_rate": trial.suggest_loguniform("learning_rate", 0.0001, 0.1),
# 'n_estimators': 10000,
# 'max_depth': trial.suggest_int("max_depth", 1,80,log = True),
# "min_child_weight": trial.suggest_loguniform("min_child_weight", 3, 100)
# }
# # Call Back For pruning unpromising trails
# pruning_callback = optuna.integration.XGBoostPruningCallback(trial, "validation_0-auc")
# model = XGBClassifier(**params)
# Metrics = [] # To store the AUC Scores
# # Cross_validation Loop:
# for train_index, test_index in kf.split(X,y):
# # Setting Training and Testing Sets For Each Folds
# x_train, x_test = Features.iloc[train_index, :], Features.iloc[test_index, :]
# y_train, y_test = Target.iloc[train_index], Target.iloc[test_index]
# # Fitting The Model on Each Fold
# model.fit(x_train, y_train,
# eval_set = [(x_test, y_test)],
# eval_metric = 'auc',
# verbose = 0,
# callbacks = [pruning_callback],
# early_stopping_rounds = early_stopping_rounds
# )
# # Prediction for Each Fold
# y_pred = model.predict(x_test)
# # Appending AUC Scores For Each Test Fold
# Metrics.append(metrics.roc_auc_score(y_test,y_pred))
# return(np.mean(Metrics))
# # Creating Optuna Study
# study_xgb = optuna.create_study(direction="maximize")
# study_xgb.optimize(lambda trial : objective_xgb(trial, X=Features, y=Target, early_stopping_rounds=400),
# n_trials=200, n_jobs=-1)
# print("Best Value : ",study_xgb.best_value)
# print("\n\n Best Params : ",study_xgb.best_params)
# Storing The Best Params
best_params = {'alpha': 1.7857171526334645e-07, 'lambda': 4.969111661609928e-06, 'gamma': 7.0867106981287455e-06,
'colsample_bytree': 0.7440989368070011, 'subsample': 0.5628507665726754,
'learning_rate': 0.004267786629571522, 'max_depth': 4, 'min_child_weight': 3.082113258455785}
# Final Prediction and Validation for XGBoost
temp = {'verbosity' : 0, 'eval_metric' : 'auc',
"seed": 101,'n_jobs': -1,'n_estimators': 10000}
best_params.update(temp)
# Model with best Params
model_xgb = XGBClassifier(**best_params)
Xgb_predictions = np.zeros(len(Target.values))
Xgb_predictions = pd.DataFrame(Xgb_predictions)
# Cross_validation Loop:
for train_index, test_index in kf.split(Features,Target):
# Setting Training and Testing Sets For Each Folds
x_train, x_test = Features.iloc[train_index,:], Features.iloc[test_index,:]
y_train, y_test = Target.iloc[train_index], Target.iloc[test_index]
# Fitting The Model on Each Fold
model_xgb.fit(x_train, y_train,
eval_set = [(x_test, y_test)],
verbose = 0,
early_stopping_rounds = 100
)
# Prediction for Each Fold
Xgb_predictions.iloc[test_index,0] = model_xgb.predict(x_test)
print("Best Number of Trees(estimators)",model_xgb.best_ntree_limit)
Best Number of Trees(estimators) 10
print("AUC_ROC Score : ",metrics.roc_auc_score(Target,Xgb_predictions))
print("\n Classification Report : \n\n",metrics.classification_report(Target,Xgb_predictions))
AUC_ROC Score : 0.719044776119403
Classification Report :
precision recall f1-score support
0 0.79 0.86 0.82 500
1 0.68 0.58 0.63 268
accuracy 0.76 768
macro avg 0.74 0.72 0.73 768
weighted avg 0.75 0.76 0.76 768
# # Defining A Objecting Function For Hyper Paramater Tuning
# def objective_lgbm(trial, X, y, early_stopping_rounds):
# # Param Space
# params = {
# "seed":101,
# #"verbosity":-1,
# "save_binary":True,
# "num_threads":4,
# "boosting":"gbdt",
# "extra_trees":True,
# "metric":"auc",
# #"xgboost_dart_mode":trial.suggest_categorical("xgboost_dart_mode", [True,False]),
# "is_unbalance":True,
# #"device_type": trial.suggest_categorical("device_type", ['gpu']),
# "n_estimators":10000,
# "learning_rate": trial.suggest_loguniform("learning_rate", 0.1, 1.0),
# "num_leaves": trial.suggest_int("num_leaves", 20, 2500, step=10),
# "max_depth": trial.suggest_int("max_depth", 2, 20, step = 1),
# "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1,100,step=2),
# "lambda_l1": trial.suggest_float("lambda_l1", 0.00001, 4, step = 0.1),
# "lambda_l2": trial.suggest_float("lambda_l2", 0.001, 20.0, step = 0.1),
# "min_gain_to_split": trial.suggest_float("min_gain_to_split", 1, 9),
# "bagging_fraction" : 0.95,
# "feature_fraction" : 0.95,
# "bagging_freq" : 5,
# "bagging_seed" : 101
# }
# # Call Back For pruning unpromising trails
# pruning_callback = optuna.integration.LightGBMPruningCallback(trial, "auc")
# model = LGBMClassifier(**params, objective="binary",silent=True,verbose=-100)
# Metrics = [] # To store the AUC Scores
# # Cross_validation Loop:
# for train_index, test_index in kf.split(X,y):
# # Setting Training and Testing Sets For Each Folds
# x_train, x_test = Features.iloc[train_index, :], Features.iloc[test_index, :]
# y_train, y_test = Target.iloc[train_index], Target.iloc[test_index]
# # Fitting The Model on Each Fold
# model.fit(x_train, y_train,
# eval_set = [(x_test, y_test)],
# eval_metric = "auc",
# verbose = False,
# callbacks = [pruning_callback],
# early_stopping_rounds = early_stopping_rounds
# )
# # Prediction for Each Fold
# y_pred = model.predict(x_test)
# # Appending AUC Scores For Each Test Fold
# Metrics.append(metrics.roc_auc_score(y_test,y_pred))
# return(np.mean(Metrics))
# # Creating Optuna Study
# study_lgbm = optuna.create_study(direction="maximize")
# study_lgbm.optimize(lambda trial : objective_lgbm(trial, X=Features, y=Target, early_stopping_rounds=400),
# n_trials=200, n_jobs=-1)
best_params = {'learning_rate': 0.4107999846346181, 'num_leaves': 2450, 'max_depth': 11, 'min_data_in_leaf': 11,
'lambda_l1': 0.20001000000000002, 'lambda_l2': 0.901, 'min_gain_to_split': 1.5352032889321992}
# Final Prediction and Validation for LightGBM
temp = {'verbosity' : -100, 'metric' : 'auc', "extra_trees":True,
"seed": 101, "is_unbalance":True,'num_threads': 4 , 'n_estimators': 10000,
"save_binary":True, "bagging_fraction" : 0.95, "feature_fraction" : 0.95,
"bagging_freq" : 5, "bagging_seed" : 101}
best_params.update(temp)
# Model with best Params
model_lgbm = LGBMClassifier(**best_params, objective="binary",silent=True)
lgbm_predictions = np.zeros(len(Target.values))
lgbm_predictions = pd.DataFrame(lgbm_predictions)
# Cross_validation Loop:
for train_index, test_index in kf.split(Features,Target):
# Setting Training and Testing Sets For Each Folds
x_train, x_test = Features.iloc[train_index,:], Features.iloc[test_index,:]
y_train, y_test = Target.iloc[train_index], Target.iloc[test_index]
# Fitting The Model on Each Fold
model_lgbm.fit(x_train, y_train,
eval_set = [(x_test, y_test)],
eval_metric = "auc",
verbose = False,
early_stopping_rounds = 400
)
# Prediction for Each Fold
lgbm_predictions.iloc[test_index,0] = model_lgbm.predict(x_test)
print("\n\n Light GBM AUC_ROC Score : ", metrics.roc_auc_score(Target,lgbm_predictions))
print("\n Classification Report : \n\n",metrics.classification_report(Target,lgbm_predictions))
[LightGBM] [Warning] feature_fraction is set=0.95, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.95
[LightGBM] [Warning] min_data_in_leaf is set=11, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=11
[LightGBM] [Warning] min_gain_to_split is set=1.5352032889321992, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=1.5352032889321992
[LightGBM] [Warning] lambda_l1 is set=0.20001000000000002, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.20001000000000002
[LightGBM] [Warning] bagging_fraction is set=0.95, subsample=1.0 will be ignored. Current value: bagging_fraction=0.95
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l2 is set=0.901, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.901
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] feature_fraction is set=0.95, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.95
[LightGBM] [Warning] min_data_in_leaf is set=11, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=11
[LightGBM] [Warning] min_gain_to_split is set=1.5352032889321992, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=1.5352032889321992
[LightGBM] [Warning] lambda_l1 is set=0.20001000000000002, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.20001000000000002
[LightGBM] [Warning] bagging_fraction is set=0.95, subsample=1.0 will be ignored. Current value: bagging_fraction=0.95
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l2 is set=0.901, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.901
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] feature_fraction is set=0.95, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.95
[LightGBM] [Warning] min_data_in_leaf is set=11, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=11
[LightGBM] [Warning] min_gain_to_split is set=1.5352032889321992, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=1.5352032889321992
[LightGBM] [Warning] lambda_l1 is set=0.20001000000000002, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.20001000000000002
[LightGBM] [Warning] bagging_fraction is set=0.95, subsample=1.0 will be ignored. Current value: bagging_fraction=0.95
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l2 is set=0.901, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.901
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] feature_fraction is set=0.95, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.95
[LightGBM] [Warning] min_data_in_leaf is set=11, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=11
[LightGBM] [Warning] min_gain_to_split is set=1.5352032889321992, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=1.5352032889321992
[LightGBM] [Warning] lambda_l1 is set=0.20001000000000002, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.20001000000000002
[LightGBM] [Warning] bagging_fraction is set=0.95, subsample=1.0 will be ignored. Current value: bagging_fraction=0.95
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l2 is set=0.901, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.901
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] feature_fraction is set=0.95, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.95
[LightGBM] [Warning] min_data_in_leaf is set=11, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=11
[LightGBM] [Warning] min_gain_to_split is set=1.5352032889321992, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=1.5352032889321992
[LightGBM] [Warning] lambda_l1 is set=0.20001000000000002, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.20001000000000002
[LightGBM] [Warning] bagging_fraction is set=0.95, subsample=1.0 will be ignored. Current value: bagging_fraction=0.95
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l2 is set=0.901, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.901
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] feature_fraction is set=0.95, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.95
[LightGBM] [Warning] min_data_in_leaf is set=11, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=11
[LightGBM] [Warning] min_gain_to_split is set=1.5352032889321992, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=1.5352032889321992
[LightGBM] [Warning] lambda_l1 is set=0.20001000000000002, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.20001000000000002
[LightGBM] [Warning] bagging_fraction is set=0.95, subsample=1.0 will be ignored. Current value: bagging_fraction=0.95
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l2 is set=0.901, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.901
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] feature_fraction is set=0.95, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.95
[LightGBM] [Warning] min_data_in_leaf is set=11, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=11
[LightGBM] [Warning] min_gain_to_split is set=1.5352032889321992, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=1.5352032889321992
[LightGBM] [Warning] lambda_l1 is set=0.20001000000000002, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.20001000000000002
[LightGBM] [Warning] bagging_fraction is set=0.95, subsample=1.0 will be ignored. Current value: bagging_fraction=0.95
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l2 is set=0.901, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.901
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] feature_fraction is set=0.95, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.95
[LightGBM] [Warning] min_data_in_leaf is set=11, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=11
[LightGBM] [Warning] min_gain_to_split is set=1.5352032889321992, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=1.5352032889321992
[LightGBM] [Warning] lambda_l1 is set=0.20001000000000002, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.20001000000000002
[LightGBM] [Warning] bagging_fraction is set=0.95, subsample=1.0 will be ignored. Current value: bagging_fraction=0.95
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l2 is set=0.901, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.901
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] feature_fraction is set=0.95, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.95
[LightGBM] [Warning] min_data_in_leaf is set=11, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=11
[LightGBM] [Warning] min_gain_to_split is set=1.5352032889321992, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=1.5352032889321992
[LightGBM] [Warning] lambda_l1 is set=0.20001000000000002, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.20001000000000002
[LightGBM] [Warning] bagging_fraction is set=0.95, subsample=1.0 will be ignored. Current value: bagging_fraction=0.95
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l2 is set=0.901, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.901
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] feature_fraction is set=0.95, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.95
[LightGBM] [Warning] min_data_in_leaf is set=11, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=11
[LightGBM] [Warning] min_gain_to_split is set=1.5352032889321992, min_split_gain=0.0 will be ignored. Current value: min_gain_to_split=1.5352032889321992
[LightGBM] [Warning] lambda_l1 is set=0.20001000000000002, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.20001000000000002
[LightGBM] [Warning] bagging_fraction is set=0.95, subsample=1.0 will be ignored. Current value: bagging_fraction=0.95
[LightGBM] [Warning] num_threads is set=4, n_jobs=-1 will be ignored. Current value: num_threads=4
[LightGBM] [Warning] lambda_l2 is set=0.901, reg_lambda=0.0 will be ignored. Current value: lambda_l2=0.901
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
Light GBM AUC_ROC Score : 0.7861194029850747
Classification Report :
precision recall f1-score support
0 0.88 0.77 0.82 500
1 0.65 0.80 0.72 268
accuracy 0.78 768
macro avg 0.77 0.79 0.77 768
weighted avg 0.80 0.78 0.79 768
print("Best Number of Trees(estimators)", model_lgbm.best_iteration_)
Best Number of Trees(estimators) 78
# %%time
# grid_param = {'n_estimators': optuna.distributions.IntUniformDistribution(220,3000,100),
# 'max_features': optuna.distributions.CategoricalDistribution(['sqrt','auto','log2']),
# "class_weight" : optuna.distributions.CategoricalDistribution(['balanced_subsample']),
# "bootstrap": optuna.distributions.CategoricalDistribution([True,False]),
# 'max_depth': optuna.distributions.IntLogUniformDistribution(2,50,1),
# 'min_samples_split': optuna.distributions.IntUniformDistribution(2,10),
# 'min_samples_leaf': optuna.distributions.IntUniformDistribution(1,10),
# 'criterion': optuna.distributions.CategoricalDistribution(["gini", "entropy"])}
# optuna_search = OptunaSearchCV(estimator=RandomForestClassifier(random_state=101, n_jobs=-1,warm_start=True), param_distributions = grid_param,
# cv = kf, n_jobs = -1, n_trials=100, random_state = 101, refit = True,
# scoring = 'roc_auc', verbose = 3)
# optuna_search.fit(Features, Target)
# print(optuna_search.best_estimator_)
# print(optuna_search.best_score_)
#Creating Model Object
rfModel = RandomForestClassifier(bootstrap=False, class_weight='balanced_subsample',
criterion='entropy', max_depth=5, max_features='sqrt',
min_samples_leaf=8, min_samples_split=6,
n_estimators=320, n_jobs=-1, random_state=101,
warm_start=True)
rfModel.fit(Features,Target)
rfModel_predictions = model_selection.cross_val_predict(rfModel,Features,Target,cv=kf,n_jobs=-1)
# Cross Validation Score(Accuracy)
print("\n\n Random Forest AUC_ROC Score : ", metrics.roc_auc_score(Target,rfModel_predictions))
print("\n Classification Report : \n\n",metrics.classification_report(Target,rfModel_predictions))
Random Forest AUC_ROC Score : 0.7699850746268657
Classification Report :
precision recall f1-score support
0 0.88 0.73 0.80 500
1 0.62 0.81 0.70 268
accuracy 0.76 768
macro avg 0.75 0.77 0.75 768
weighted avg 0.79 0.76 0.76 768
Senstivity
Senstivity Aka True Positive Rate(Recall) : - Out of Total Actual Positive Patiens ,How Many Did we Correctly Predicted to be Diabetes Positive
Precision : - Out of Total Predicted Positive Patiens ,How Many were Actual Diabetes Positive
Specificity
A good Model is Who's Senstivity > Specificity
we take Following Points For Evaluation of Model Depending of Our Problem Statement and The data :-
False Negative
i.e Number of Patients Incorrectly Classified as Non-Diabeic, this is critical since Patients got Incorrectly Predicted to be Non-Diabitic, but they were Diabetic In Actual, They will not Get the treatment and thus results can be fatal, to Healtcare Company as well as to Patients
False Positive
i.e Number of Patients Incorrectly Classified as Diabetic, this is critical since Patients got Incorrectly Predicted to be Diabitic, but they were Non-Diabetic In Actual, They will be trated as Diabetic Patients and the Steroids can result in fatal Side effects
Thus False Positive and False Negative are Equally Critical, i.e , Precision and Recall are Eqaully Important
So we Measure The Model Performance With F-Beta Score
5.1 Storing The F1-Score of All Models
# F1 Scores
Knn_F1_Score = metrics.f1_score(Target,Knn_predictions)
XGB_F1_Score = metrics.f1_score(Target,Xgb_predictions)
LGBM_F1_Score = metrics.f1_score(Target,lgbm_predictions)
RF_F1_Score = metrics.f1_score(Target,rfModel_predictions)
5.2 Classification report of All Model
print("Classification Report KNN: \n\n", metrics.classification_report(Target,Knn_predictions),end="")
print("\n\nClassification Report XGB: \n\n", metrics.classification_report(Target,Xgb_predictions))
print("\n\nClassification Report LGBM: \n\n", metrics.classification_report(Target,lgbm_predictions))
print("\n\nClassification Report RF: \n\n", metrics.classification_report(Target,rfModel_predictions))
Classification Report KNN:
precision recall f1-score support
0 0.80 0.87 0.83 500
1 0.70 0.59 0.64 268
accuracy 0.77 768
macro avg 0.75 0.73 0.74 768
weighted avg 0.76 0.77 0.76 768
Classification Report XGB:
precision recall f1-score support
0 0.79 0.86 0.82 500
1 0.68 0.58 0.63 268
accuracy 0.76 768
macro avg 0.74 0.72 0.73 768
weighted avg 0.75 0.76 0.76 768
Classification Report LGBM:
precision recall f1-score support
0 0.88 0.77 0.82 500
1 0.65 0.80 0.72 268
accuracy 0.78 768
macro avg 0.77 0.79 0.77 768
weighted avg 0.80 0.78 0.79 768
Classification Report RF:
precision recall f1-score support
0 0.88 0.73 0.80 500
1 0.62 0.81 0.70 268
accuracy 0.76 768
macro avg 0.75 0.77 0.75 768
weighted avg 0.79 0.76 0.76 768
# Creating A Data Frame Wwith Models And their F1-Scores
(pd.DataFrame({"Model":["KNN", "XGboost", "LGBM", "Random_Forest"],
"F1-Score" : [Knn_F1_Score, XGB_F1_Score, LGBM_F1_Score, RF_F1_Score]})).sort_values(by="F1-Score", ascending=False)
| Model | F1-Score | |
|---|---|---|
| 2 | LGBM | 0.719064 |
| 3 | Random_Forest | 0.700162 |
| 0 | KNN | 0.640974 |
| 1 | XGboost | 0.629032 |
Conclusion
Best Performing Model In terms of F1-Score is LGBM, Because it has Ability to Deal with Imbalanced Data Set Better than any Model
Knn turns out to be worst of all, Because it wasnt suitable to data as it is Linear Model and Our data is Non-Linear